Applied ML

Ben Lambert

Material

  • feature engineering
  • metrics for classification and regression
  • cross-validation

Feature engineering

What is feature engineering?

  1. conversion of raw data into form amenable to ML estimation \(\implies\) data munging
  2. designing features that have predictive power \(\implies\) data visualisation

1 is necessary for all ML models; 2 is most necessary for non-deep learning models

Example dataset: California house prices (Kaggle)

longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
10283 -117.78 33.86 16 3471 708 1769 691 4.1064 246100 LESS 1H OCEAN
11497 -118.03 33.71 26 1483 251 738 235 6.0000 271400 LESS 1H OCEAN
6965 -118.06 33.99 47 1588 309 827 292 3.7833 166100 LESS 1H OCEAN
11645 -118.04 33.81 22 4057 624 2204 643 5.8527 241000 LESS 1H OCEAN
2012 -119.83 36.73 21 1702 358 1347 316 2.4137 62100 INLAND
13278 -117.64 34.09 34 2839 659 1822 631 3.0500 121300 INLAND
65 -122.30 37.81 48 828 182 392 133 2.5938 73500 NEAR BAY
17104 -122.23 37.45 34 4177 723 1586 660 5.0457 395100 NEAR OCEAN
13807 -117.01 34.90 36 2181 555 1404 492 2.3077 55500 INLAND
9796 -121.84 36.52 18 3165 533 1312 434 6.5234 357400 NEAR OCEAN
13346 -117.64 34.05 27 1407 362 1684 350 2.1944 95700 INLAND
7754 -118.11 33.91 19 3056 759 1561 740 3.1369 196900 LESS 1H OCEAN
2750 -115.56 32.76 15 1278 217 653 185 4.4821 140300 INLAND
4461 -118.18 34.10 10 1940 445 763 412 4.9750 166700 LESS 1H OCEAN
776 -122.10 37.65 25 2538 494 1185 501 4.5417 194400 NEAR BAY
13635 -117.34 34.08 33 4924 1007 3502 953 3.2330 99400 INLAND
8276 -118.16 33.78 15 4798 1374 3087 1212 2.1270 163300 NEAR OCEAN
20237 -119.27 34.27 52 459 112 276 107 2.3750 198400 NEAR OCEAN
493 -122.25 37.86 52 1587 444 878 449 1.7652 336800 NEAR BAY
11533 -118.09 33.77 26 5359 1508 1829 1393 1.7675 61300 LESS 1H OCEAN
11526 -118.05 33.72 22 5416 1271 2260 1184 3.8038 174500 LESS 1H OCEAN
11907 -117.39 33.97 48 1915 348 1060 376 3.4044 117900 INLAND
14710 -117.07 32.79 25 2489 314 911 309 7.8336 277600 NEAR OCEAN
17240 -119.69 34.43 30 1273 343 1082 325 2.5104 228100 LESS 1H OCEAN
10610 -117.79 33.68 13 2636 416 1137 404 7.2118 311500 LESS 1H OCEAN
18813 -121.41 40.82 16 2668 516 915 362 2.3393 90300 INLAND
5344 -118.44 34.04 31 2670 662 1535 631 3.0714 347800 LESS 1H OCEAN
6068 -117.86 34.02 19 6300 937 3671 943 5.9716 262100 LESS 1H OCEAN
20396 -118.85 34.25 17 5593 732 1992 660 7.2965 342900 LESS 1H OCEAN
7075 -117.98 33.94 32 2562 491 1222 446 4.0985 226200 LESS 1H OCEAN
6985 -118.02 33.96 36 2071 398 988 404 4.6226 219700 LESS 1H OCEAN
11803 -121.29 38.89 10 653 120 407 146 3.3889 110800 INLAND
5288 -118.47 34.05 22 5215 1193 2048 1121 4.7009 500001 LESS 1H OCEAN
12386 -116.38 33.71 17 12509 2460 2737 1423 4.5556 258100 INLAND
16170 -122.55 37.79 32 2131 625 1229 572 2.9201 322200 NEAR OCEAN
13181 -117.74 33.97 4 9755 1748 4662 1583 5.6501 254900 LESS 1H OCEAN
16774 -122.48 37.69 43 2661 455 1384 456 4.2421 257500 NEAR OCEAN
17953 -121.96 37.34 37 663 127 293 132 3.7813 247800 LESS 1H OCEAN
7759 -118.13 33.91 34 916 162 552 164 4.9107 222000 LESS 1H OCEAN
15661 -122.42 37.78 26 812 507 628 445 2.3304 500001 NEAR BAY
12614 -121.52 38.51 23 6876 1456 2942 1386 3.0963 156900 INLAND
6308 -117.87 33.99 21 2837 515 2031 555 4.9271 209700 LESS 1H OCEAN
14118 -117.11 32.73 34 1096 221 574 223 3.8355 126700 NEAR OCEAN
9978 -122.48 38.54 37 1898 359 973 340 4.2096 256600 INLAND
10115 -117.94 33.94 26 1962 540 1236 520 2.2156 145000 LESS 1H OCEAN
12900 -121.34 38.63 13 3033 540 1363 519 4.0036 161700 INLAND
18384 -121.80 37.19 45 1797 303 870 281 4.5417 434500 LESS 1H OCEAN
18270 -122.07 37.36 28 4612 608 1686 567 10.0346 500001 LESS 1H OCEAN
12244 -116.99 33.75 18 9601 2401 4002 2106 1.4366 77000 INLAND
6080 -117.85 34.09 16 4556 639 2066 651 6.4667 263900 LESS 1H OCEAN
7406 -118.21 33.96 39 2265 628 2323 599 2.1522 155300 LESS 1H OCEAN
4867 -118.28 34.04 25 1582 780 2390 719 1.4167 200000 LESS 1H OCEAN
897 -121.96 37.53 23 2215 475 1278 492 4.2955 218800 LESS 1H OCEAN
3978 -118.62 34.19 35 1934 307 905 315 5.5101 267400 LESS 1H OCEAN
20366 -118.92 34.18 17 2400 352 1067 323 6.3522 259300 LESS 1H OCEAN
5473 -118.45 33.99 52 1010 244 573 242 4.1861 363200 LESS 1H OCEAN
9026 -118.74 34.05 19 3487 686 2782 584 7.9184 500001 NEAR OCEAN
10069 -120.17 39.33 18 1046 204 486 179 4.1190 110900 INLAND
17627 -121.94 37.25 16 3942 749 1894 737 5.2894 332800 LESS 1H OCEAN
15920 -122.42 37.73 52 3230 654 1765 611 3.3333 292300 NEAR BAY
8077 -118.19 33.83 30 2246 552 1032 548 3.5871 347100 NEAR OCEAN
10765 -117.91 33.63 20 3442 1526 1427 977 3.1985 106300 LESS 1H OCEAN
11028 -117.83 33.80 31 2016 409 1095 405 3.8681 196000 LESS 1H OCEAN
14232 -117.04 32.68 9 3087 609 1530 556 3.7750 125000 NEAR OCEAN
14322 -117.15 32.71 52 402 183 557 172 1.3125 87500 NEAR OCEAN
13892 -116.32 34.10 10 4256 861 1403 686 2.6618 81000 INLAND
20068 -120.40 37.98 19 2010 433 910 390 2.6696 121200 INLAND
1646 -121.89 37.82 4 11444 1355 3898 1257 13.2949 500001 INLAND
2655 -124.23 40.54 52 2694 453 1152 435 3.0806 106700 NEAR OCEAN
18344 -122.15 37.43 47 2600 490 1149 465 5.0203 476300 NEAR BAY
12315 -116.76 33.46 6 1251 268 544 216 3.0694 173400 INLAND
10284 -117.76 33.87 16 3973 595 1971 575 6.4265 263700 LESS 1H OCEAN
3652 -118.44 34.21 41 1440 325 1014 322 2.8750 168600 LESS 1H OCEAN
12920 -121.31 38.66 27 1713 282 761 295 5.2081 136400 INLAND
3015 -119.15 34.83 6 8733 1600 2006 736 4.5724 168400 INLAND
4428 -118.25 34.07 16 719 225 801 218 2.3942 133300 LESS 1H OCEAN
4919 -118.26 34.00 41 1733 492 1776 453 1.6221 104200 LESS 1H OCEAN
14606 -117.17 32.81 33 3064 506 1355 488 4.2200 178700 NEAR OCEAN
7486 -118.21 33.92 37 1705 403 1839 410 2.5833 132700 LESS 1H OCEAN
9774 -121.25 36.32 12 4776 1082 4601 1066 2.9184 100500 LESS 1H OCEAN
16333 -121.34 38.03 20 4213 751 2071 714 4.4063 130800 INLAND
9438 -119.95 37.47 32 1312 315 600 265 1.5000 91500 INLAND
16529 -121.17 37.88 22 1283 256 3082 239 3.5365 111800 INLAND
5504 -118.43 33.99 45 1899 461 1260 415 2.6667 320000 LESS 1H OCEAN
14496 -117.20 32.85 26 2298 549 980 555 2.4207 213500 NEAR OCEAN
13375 -117.51 34.16 2 718 98 119 50 4.1000 315000 INLAND
5156 -118.26 33.96 37 1625 383 1243 350 1.3971 89800 LESS 1H OCEAN
1294 -121.78 38.00 8 2371 375 1094 396 5.3245 174500 INLAND
18343 -122.14 37.43 52 1383 227 551 249 6.5829 500001 NEAR BAY
10205 -117.93 33.87 52 950 229 429 185 2.3150 182100 LESS 1H OCEAN
12291 -116.97 33.94 29 3197 632 1722 603 3.0432 91200 INLAND
17179 -122.47 37.51 15 4974 764 2222 774 6.7606 364300 NEAR OCEAN
13598 -117.28 34.09 44 376 NA 273 107 2.2917 90800 INLAND
7084 -118.01 33.93 31 3395 742 1886 737 4.4118 174400 LESS 1H OCEAN
8198 -118.14 33.80 43 2506 531 1230 543 3.4211 203900 NEAR OCEAN
3226 -119.66 36.30 18 1147 202 717 212 3.3681 70500 INLAND
13923 -114.94 34.55 20 350 95 119 58 1.6250 50000 INLAND
17003 -122.25 37.56 19 7976 1406 3437 1338 5.6396 430300 NEAR BAY
5580 -118.30 33.83 31 2693 661 1598 618 3.1851 240200 LESS 1H OCEAN
6979 -118.03 33.97 32 2468 552 1190 479 3.8275 238500 LESS 1H OCEAN

Aim: develop model to predict house prices

Examine data (note this is at block level)

Missing features

Handling missing data: options

  • drop rows with missing obs.
  • drop columns (i.e. variables) with lots of missing obs.
  • impute observations

Imputation

  • lots of options
  • simple method: use median (for continuous data) or mode (for categorical / ordinal)
  • model dependencies: regress variables on all other variables and use this
  • models incorporating uncertainty: useful if substantial portion missing

How to handle categorical data?

Median income by ocean proximity

One-hot encoding

ocean_proximity coast island
10283 LESS 1H OCEAN 1 0
11497 LESS 1H OCEAN 1 0
6965 LESS 1H OCEAN 1 0
11645 LESS 1H OCEAN 1 0
2012 INLAND 0 0
13278 INLAND 0 0
65 NEAR BAY 1 0
17104 NEAR OCEAN 1 0
13807 INLAND 0 0
9796 NEAR OCEAN 1 0
13346 INLAND 0 0
7754 LESS 1H OCEAN 1 0
2750 INLAND 0 0
4461 LESS 1H OCEAN 1 0
776 NEAR BAY 1 0
13635 INLAND 0 0
8276 NEAR OCEAN 1 0
20237 NEAR OCEAN 1 0
493 NEAR BAY 1 0
11533 LESS 1H OCEAN 1 0
11526 LESS 1H OCEAN 1 0
11907 INLAND 0 0
14710 NEAR OCEAN 1 0
17240 LESS 1H OCEAN 1 0
10610 LESS 1H OCEAN 1 0
18813 INLAND 0 0
5344 LESS 1H OCEAN 1 0
6068 LESS 1H OCEAN 1 0
20396 LESS 1H OCEAN 1 0
7075 LESS 1H OCEAN 1 0
6985 LESS 1H OCEAN 1 0
11803 INLAND 0 0
5288 LESS 1H OCEAN 1 0
12386 INLAND 0 0
16170 NEAR OCEAN 1 0
13181 LESS 1H OCEAN 1 0
16774 NEAR OCEAN 1 0
17953 LESS 1H OCEAN 1 0
7759 LESS 1H OCEAN 1 0
15661 NEAR BAY 1 0
12614 INLAND 0 0
6308 LESS 1H OCEAN 1 0
14118 NEAR OCEAN 1 0
9978 INLAND 0 0
10115 LESS 1H OCEAN 1 0
12900 INLAND 0 0
18384 LESS 1H OCEAN 1 0
18270 LESS 1H OCEAN 1 0
12244 INLAND 0 0
6080 LESS 1H OCEAN 1 0
7406 LESS 1H OCEAN 1 0
4867 LESS 1H OCEAN 1 0
897 LESS 1H OCEAN 1 0
3978 LESS 1H OCEAN 1 0
20366 LESS 1H OCEAN 1 0
5473 LESS 1H OCEAN 1 0
9026 NEAR OCEAN 1 0
10069 INLAND 0 0
17627 LESS 1H OCEAN 1 0
15920 NEAR BAY 1 0
8077 NEAR OCEAN 1 0
10765 LESS 1H OCEAN 1 0
11028 LESS 1H OCEAN 1 0
14232 NEAR OCEAN 1 0
14322 NEAR OCEAN 1 0
13892 INLAND 0 0
20068 INLAND 0 0
1646 INLAND 0 0
2655 NEAR OCEAN 1 0
18344 NEAR BAY 1 0
12315 INLAND 0 0
10284 LESS 1H OCEAN 1 0
3652 LESS 1H OCEAN 1 0
12920 INLAND 0 0
3015 INLAND 0 0
4428 LESS 1H OCEAN 1 0
4919 LESS 1H OCEAN 1 0
14606 NEAR OCEAN 1 0
7486 LESS 1H OCEAN 1 0
9774 LESS 1H OCEAN 1 0
16333 INLAND 0 0
9438 INLAND 0 0
16529 INLAND 0 0
5504 LESS 1H OCEAN 1 0
14496 NEAR OCEAN 1 0
13375 INLAND 0 0
5156 LESS 1H OCEAN 1 0
1294 INLAND 0 0
18343 NEAR BAY 1 0
10205 LESS 1H OCEAN 1 0
12291 INLAND 0 0
17179 NEAR OCEAN 1 0
13598 INLAND 0 0
7084 LESS 1H OCEAN 1 0
8198 NEAR OCEAN 1 0
3226 INLAND 0 0
13923 INLAND 0 0
17003 NEAR BAY 1 0
5580 LESS 1H OCEAN 1 0
6979 LESS 1H OCEAN 1 0

House price versus continuous attributes

Income vs house price

Handling censored observations

  • leave them as they are
  • or could drop censored observations altogether
  • (better) impute them
  • (better still) explicitly model uncertainty in them

Feature creation: rooms per household

Bedrooms per room

Persons per house

Feature scaling

  • ML algorithm training is much improved if features on similar scales
  • rescaled features can also better represent relationships between variables

House price rescaling (ignoring upper limits)

Option: standardise

Subtract mean and divide through by sd

Option: normalise

Option: log transform

Feature scaling: no golden rule

  • graph features before/after scaling: outliers can seriously affect scaling
  • build feature scaling options into ML pipeline
  • evaluate impact of different scaling options on test set prediction

Importance of pipelines

  • many options for imputation, feature creation, feature scaling
  • ML software, for example, Scikit-learn in Python has great options for creating data processing pipelines incorporating these
  • pipelines make it easy to cleanly (and without error) process data
  • and to test out different options

Questions?

Metrics

Metrics for regression

most common, root mean squared error:

\[\begin{equation} \text{RMSE} = \sqrt{\frac{1}{K}\sum_{i=1}^{K} (\hat{y}_i - y_i)^2} \end{equation}\]

also used:

\[\begin{equation} R^2 = \frac{\text{variation explained by model}}{\text{total variation}} \end{equation}\]

Metrics for binary classification: confusion matrix

True positive and false positive rate

Classification boundaries

classifiers output class probabilities. For example,

\[\begin{equation} \text{Pr}(y_i=\text{cat}|x_i) = 0.4 \end{equation}\]

an obvious choice for classification boundary is \(\text{Pr}(y_i=\text{cat}|x_i) = 0.5\). But this doesn’t work well for imbalanced datasets. Want good performance across all boundaries

Classification boundaries

vary boundary cutoff value and calculate true positive rates (TPR) and false positive rates (FPR).

When \(\text{Pr}(y_i=\text{cat}|x_i) = 0.0\) is boundary \(\implies\) all positive:

  • \(\text{TPR}=\frac{\text{TP}}{\text{TP}+\text{FN}} = \frac{\text{TP}}{\text{TP} + 0} = 1\)
  • \(\text{FPR}=\frac{\text{FP}}{\text{FP}+\text{TN}} = \frac{\text{FP}}{\text{FP} + 0} = 1\)

When \(\text{Pr}(y_i=\text{cat}|x_i) = 1.0\) is boundary \(\implies\) all negative:

  • \(\text{TPR}=\frac{\text{TP}}{\text{TP}+\text{FN}} = \frac{0}{0 + \text{FN}} = 0\)
  • \(\text{FPR}=\frac{\text{FP}}{\text{FP}+\text{TN}} = \frac{0}{0 + \text{TN}} = 0\)

Random classifier: ignores rows and (here) assigns 1/4 to cats

ROC and AUC: random classifier

ROC and AUC: random classifier

Cross-validation

Applying a gradient boosted machine

  • use a gradient boosted machine (a form of decision tree) to predict house price
  • use lots of deep trees
  • (on a subset of full data)

Result

Performance of model on independent data

How to tune hyperparameters?

  • more complex models \(\implies\) explains more variation in a given dataset
  • some variation is idiosyncratic nuisance variation \(\implies\) fits noise not signal
  • instead, fit data to training set and evaluate performance on separate cross-validation set
  • use cross-validation set performance to determine optimal hyperparameters

Train and cross-validation set performance

Cross-validation approaches

  • using a single train / CV set risks tuning to the noise in the CV set
  • solution: use many train / CV sets along subsets of your data
  • example: k-fold cross validation

3-fold cross-validation

How many folds?

How to assess predictive accuracy?

  • could state predictive accuracy as that achieved on CV set \(\implies\) overestimate
  • instead hold aside a separate test set on which a model is evaluated only once
  • make test set similar to task model eventually used for
  • typically something like 70%-90% train + CV and leftover is test

Workflow

Questions?

Understanding ML prediction

Data visualisation

  • in statistics in general no important result should come as a (total) surprise
  • especially true in ML
  • should know your problem well enough before starting ML proper
  • \(\implies\) data visualisation key
  • (also building simpler more understandable models first can help)

Variable importance

  • after fitting model \(\implies\) good to know which variable drives performance
  • straightforward for simple models like linear and logistic regression
  • harder for more black-boxy models but methods exist

Example: RF variable importance

Practical things

Which model to choose when

  • simpler models can be useful to guide complex models
  • use performance on validation set to guide model choice
  • for rectangular data \(\implies\) RFs and gradient boosted models like XGBoost best
  • for data with richer structure \(\implies\) deep learning

Golden rules of supervised ML

  1. understand your data: visualise
  2. pre-process data and design useful features; build pipelines
  3. determine what is a good baseline accuracy and, if possible, target accuracy
  4. make sure you align CV and test sets
  5. know how your ML model works; what its hyperparameters mean
  6. choose hyperparameters via a grid search
  7. examine literature for domain level model choice and hyperparameter choice

Questions?